feat: Deepseek migration to Megatron-Bridge + CP support by yfw · Pull Request #1059 · NVIDIA-NeMo/RL

yfw · 2025-09-04T00:32:06Z

What does this PR do ?

This PR provides deepseek support using Megatron-Bridge. It also adds CP support for deepseek by using the latest mcore. This PR also includes an update to the latest mcore (taken from #1042).

Closes #1043
Closes #821

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Nightly runs: https://wandb.ai/nvidia/nemo-rl/reports/Tron-vs-Bridge-vs-Mcore-ToT--VmlldzoxNDI3NTMyMQ

Summary by CodeRabbit

New Features
- Added a global config flag to toggle MoE permutation fusion (default: false) across examples and tooling.
- Enabled trusted remote-code support for model and tokenizer loading.
Chores
- Updated third‑party submodules to newer revisions.
- Added filelock to pinned dependencies.
Tests
- Updated unit test configs to include the new MoE permutation fusion flag.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

terrykong · 2025-09-04T00:38:46Z

Since we're moving Megatron ahead, can you run all the Megatron nightly tests? I can share the baselines from the bridge PR for comparison

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

logprob_mb_tokens should be an optional seq packing parameter (not needed in SFT) first attempt to test run times more timeout fixes remove train_mv_tokens/logprob_mb_tokens since it's not needed in the init and causes crashes Signed-off-by: Terry Kong <terryk@nvidia.com> fix gemma27b parallel config Signed-off-by: Terry Kong <terryk@nvidia.com> revert steps and time since that hides regression Signed-off-by: Terry Kong <terryk@nvidia.com> revert other configs that hide regression Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

… into yifu/mbridge_dsv3_mcoretot

github-actions · 2025-09-06T18:11:31Z

✅ Submodule Fast-Forward Check Results

Check based on commit: d83ba06 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2025-09-09T21:24:55Z

✅ Submodule Fast-Forward Check Results

Check based on commit: f2e21fc (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

nemo_rl/models/megatron/common.py (6)

36-44: Return annotation does not match the actual return (5-tuple)

Function returns five values but type hints specify four; also second element type is missing.

Apply:

-def _pack_sequences_for_megatron(
-    input_ids: torch.Tensor,
-    seq_lengths: torch.Tensor,
-    pad_individual_seqs_to_multiple_of: int = 1,
-    pad_packed_seq_to: Optional[int] = None,
-    cp_rank: int = 0,
-    cp_size: int = 1,
-) -> tuple[torch.Tensor, PackedSeqParams, torch.Tensor, Optional[torch.Tensor]]:
+def _pack_sequences_for_megatron(
+    input_ids: torch.Tensor,
+    seq_lengths: torch.Tensor,
+    pad_individual_seqs_to_multiple_of: int = 1,
+    pad_packed_seq_to: Optional[int] = None,
+    cp_rank: int = 0,
+    cp_size: int = 1,
+) -> tuple[torch.Tensor, torch.Tensor, PackedSeqParams, torch.Tensor, Optional[torch.Tensor]]:

53-60: Docstring “Returns” order is wrong

Docstring says first is packed_input_ids, but code returns all_input_ids first. Fix to prevent misuse.

Apply:

-        Tuple of:
-        - packed_input_ids: Packed input tensor [1, T]
-        - input_ids_cp_sharded: Sharded input tensor [cp_size, T // cp_size]
+        Tuple of:
+        - all_input_ids: Concatenated (unsharded) input tensor [1, T_total]
+        - input_ids_cp_sharded: CP-sharded packed input tensor for this rank [1, T_shard]
         - packed_seq_params: PackedSeqParams object
         - cu_seqlens: Cumulative sequence lengths
         - cu_seqlens_padded: Padded cumulative sequence lengths

72-91: CP padding invariant not enforced (can break CP semantics)

When cp_size > 1, each sequence should be padded to multiple of cp_size*2 (per your Notes), but pad_factor is driven solely by pad_individual_seqs_to_multiple_of. This can cause misalignment between pack/unpack and TE kernels.

Apply:

-    pad_factor = pad_individual_seqs_to_multiple_of
+    pad_factor = pad_individual_seqs_to_multiple_of
+    if cp_size > 1 and pad_factor <= 1:
+        # Ensure per-seq padding meets CP requirements for causal attention load balancing
+        pad_factor = cp_size * 2

And consider documenting that callers can override via pad_individual_seqs_to_multiple_of.

121-129: Last-sequence padding can go negative

When pad_packed_seq_to < running_seq_len, padded_seq_len becomes negative and corrupts running_seq_len. Add a check.

Apply:

             if b == batch_size - 1 and pad_packed_seq_to is not None:
-                padded_seq_len = pad_packed_seq_to - running_seq_len
+                extra = pad_packed_seq_to - running_seq_len
+                if extra < 0:
+                    raise ValueError(
+                        f"pad_packed_seq_to ({pad_packed_seq_to}) smaller than accumulated length ({running_seq_len})"
+                    )
+                padded_seq_len = extra

355-361: Guard against temperature=0

Division by zero possible if config sets temperature: 0.

Apply:

-            output_tensor.div_(policy_cfg["generation"]["temperature"])
+            temp = policy_cfg["generation"].get("temperature", 1.0)
+            if temp and temp != 1.0:
+                output_tensor.div_(temp)

422-446: broadcast_tensor uses global ranks with group operations

dist.broadcast(_object_list|tensor) expects src to be the rank within the given group. Here, src_rank is documented and used as a global rank, which can target the wrong process.

Minimal fix: convert global src to group src before broadcasts, or change the API to accept src_group_rank. Example conversion:

-    if rank == src_rank:
+    group_rank = dist.get_rank(group)
+    # Convert global to group rank if supported; otherwise, require group-local rank
+    try:
+        src_group_rank = dist.get_group_rank(group, src_rank)  # PyTorch 2.3+
+    except Exception:
+        src_group_rank = src_rank  # Fallback: assume caller passed group-local rank
+
+    if group_rank == src_group_rank:
         if tensor is None:
             raise ValueError(f"Rank {rank} is source ({src_rank}) but tensor is None.")
@@
-        dist.broadcast_object_list(object_list, src=src_rank, group=group)
+        dist.broadcast_object_list(object_list, src=src_group_rank, group=group)
@@
-    dist.broadcast(tensor, src=src_rank, group=group)
+    dist.broadcast(tensor, src=src_group_rank, group=group)

Also update the docstring to clarify src is group-local.

tools/refit_verifier.py (1)

357-363: Fix tokenizer call: padding_side is not a valid encode kwarg

Set tokenizer.padding_side before calling, instead of passing as an argument.

Apply:

-    tokenized = tokenizer(
-        [prompt],
-        padding=True,
-        truncation=True,
-        return_tensors="pt",
-        padding_side="right",
-    )
+    tokenizer.padding_side = "right"
+    tokenized = tokenizer(
+        [prompt],
+        padding=True,
+        truncation=True,
+        return_tensors="pt",
+    )

nemo_rl/models/policy/megatron_policy_worker.py (1)

772-779: Gate trust_remote_code for tokenizer and AutoBridge

Same concern: make this configurable rather than always True.

Apply:

-        self.megatron_tokenizer = build_tokenizer(
+        self.megatron_tokenizer = build_tokenizer(
             tokenizer_config,
             make_vocab_size_divisible_by=self.megatron_cfg.model.make_vocab_size_divisible_by
             // self.cfg["megatron_cfg"]["tensor_model_parallel_size"],
             tensor_model_parallel_size=self.cfg["megatron_cfg"][
                 "tensor_model_parallel_size"
             ],
-            trust_remote_code=True,
+            trust_remote_code=self.cfg.get("trust_remote_code", False),
         )
@@
-        self.megatron_bridge = AutoBridge.from_hf_pretrained(
-            hf_model_name, trust_remote_code=True
-        )
+        self.megatron_bridge = AutoBridge.from_hf_pretrained(
+            hf_model_name, trust_remote_code=self.cfg.get("trust_remote_code", False)
+        )

🧹 Nitpick comments (14)

examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (2)

65-66: Nit: normalize boolean casing.

This file mixes True/False and true/false. Consider standardizing (e.g., all lower-case) for consistency across configs.

48-76: Optional: centralize repeated MoE defaults.

Given this flag is added across many configs, consider a shared partial (e.g., examples/configs/_partials/megatron_moe_defaults.yaml) and reference via defaults to reduce drift.

examples/configs/grpo_math_1B_megatron.yaml (2)

90-95: Optional: clarify comment scope and doc.

The “~20% perf speedup with sequence packing” comment sits above apply_rope_fusion. If that claim doesn’t pertain to moe_permute_fusion, consider a brief note for moe_permute_fusion (e.g., when to enable, known hardware/TE constraints).

74-136: Optional: deduplicate Megatron MoE block across recipes.

Same suggestion as in the other file—extract shared megatron_cfg MoE defaults to a partial to keep one source of truth (freeze_moe_router, router dtype/type, moe_permute_fusion, etc.).

3rdparty/Megatron-LM-workspace/Megatron-LM (1)

1-1: Prefer pinning to a stable tag or documenting a lockfile mapping; add a compat note.

Current pin is a raw SHA from a feature branch lineage. Consider:

Pinning to an annotated tag (if available) or

Keeping the SHA but recording it in a third_party.lock with a short “compat matrix” (Megatron-LM SHA ↔ Megatron-Bridge SHA ↔ TE version) to reduce drift and ease future rebases.
examples/configs/dpo.yaml (1)
112-112: Add a short doc comment and keep style consistent for booleans.

Please add a one-liner above to explain when to enable moe_permute_fusion (HW/TE/Megatron-Bridge requirements, known limitations).

Minor: booleans in this block mix True/true/false. Consider normalizing within megatron_cfg for readability.

Proposed tweak:
-    moe_permute_fusion: false
+    # Enables fused token<->expert permutation kernels for MoE (requires recent Megatron-Bridge/TE).
+    moe_permute_fusion: false
examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml (1)
76-76: Confirm interaction with freeze_moe_router=true and EP=4.

With expert_model_parallel_size: 4 and freeze_moe_router: true, fused permute should be safe but offers limited benefit if routing is static. Keep false by default unless nightly perf shows gains; add a note inline for when to flip.

Suggested comment:
-    moe_permute_fusion: false
+    # Set true if kernels are available; verify with GRPO nightly (EP=4, TP=4, PP=4).
+    moe_permute_fusion: false
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1)
74-74: FP8 path: gate fused permute on TE 2.6+ only.

This recipe targets FP8 vLLM training. Please ensure fused permute kernels are compatible with your TE version; otherwise leaving false is correct. Add a guard note to avoid accidental enablement.
-    moe_permute_fusion: false
+    # FP8: enable only with TE >= 2.6 and validated kernels.
+    moe_permute_fusion: false
examples/configs/grpo_math_1B.yaml (1)
77-77: Consistency + docs: add a brief note and align naming in docs/tools.

Add a brief inline comment on the flag purpose.

Ensure tools/refit_verifier.py and README mention moe_permute_fusion to keep configs, tooling, and docs in sync.
-    moe_permute_fusion: false
+    # Toggle fused MoE permutation kernels (perf). Default off for safety.
+    moe_permute_fusion: false
nemo_rl/models/megatron/common.py (1)
425-475: Device handling assumes CUDA; add CPU-safe fallback

torch.cuda.current_device() will fail on CPU-only runs. Use the source tensor’s device from metadata, or default to CPU when CUDA is unavailable.

Apply:
-    device = torch.cuda.current_device()
+    device = torch.device("cuda", torch.cuda.current_device()) if torch.cuda.is_available() else torch.device("cpu")
And prefer allocating tensors on the received device encoded in metadata.
tools/refit_verifier.py (3)
158-161: Align Megatron vs vLLM max_new_tokens

Megatron config sets max_new_tokens to args.max_sequence_length while vLLM uses args.max_new_tokens. This can skew comparisons. Recommend using the same source.

Apply:
-            "max_new_tokens": args.max_sequence_length,
+            "max_new_tokens": args.max_new_tokens,
279-281: Use boolean for enforce_eager

Minor: value is a string; prefer a boolean.

Apply:
-            "enforce_eager": "False",
+            "enforce_eager": False,
424-431: Avoid dumping large structures to stdout

The raw print of megatron_input_data and vllm_logprobs_data can be huge. Consider gating under a --verbose flag.

Also applies to: 425-431
nemo_rl/models/policy/megatron_policy_worker.py (1)
1169-1177: Prefer tokenizer ids over hard-coded zeros

Pass actual pad/eos token IDs to get_ltor_masks_and_position_ids for correctness across models.

Apply:
-                attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
-                    data=input_ids,
-                    eod_token=0,  # used for loss_mask, which we don't use
-                    pad_token=0,  # used for loss_mask, which we don't use
+                attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
+                    data=input_ids,
+                    eod_token=self.tokenizer.eos_token_id or 0,
+                    pad_token=self.tokenizer.pad_token_id or 0,
                     reset_position_ids=False,
                     reset_attention_mask=False,
                     eod_mask_loss=False,
                     pad_mask_loss=False,
                 )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c85276 and f2e21fc.

📒 Files selected for processing (24)

.gitmodules (1 hunks)
3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
3rdparty/Megatron-LM-workspace/Megatron-LM (1 hunks)
examples/configs/dpo.yaml (1 hunks)
examples/configs/grpo_math_1B.yaml (1 hunks)
examples/configs/grpo_math_1B_megatron.yaml (1 hunks)
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml (1 hunks)
examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.yaml (1 hunks)
examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1 hunks)
examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1 hunks)
examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1 hunks)
examples/configs/recipes/llm/grpo-qwen3-30ba3b-8n8g-megatron.yaml (1 hunks)
examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml (1 hunks)
examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml (1 hunks)
examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml (1 hunks)
examples/configs/sft.yaml (1 hunks)
examples/configs/sft_openmathinstruct2_megatron.yaml (1 hunks)
nemo_rl/models/megatron/common.py (1 hunks)
nemo_rl/models/policy/megatron_policy_worker.py (5 hunks)
tests/unit/models/generation/test_vllm_generation.py (1 hunks)
tests/unit/models/policy/test_megatron_worker.py (1 hunks)
tools/refit_verifier.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Lint check

🔇 Additional comments (21)

examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatrontp2pp2-quick.yaml (1)

65-65: Confirm dense-path compatibility
Wiring of moe_permute_fusion is implemented in nemo_rl/models/policy/megatron_policy_worker.py (line 566). Please run E2E tests to ensure dense (non-MoE) code paths tolerate this flag without side effects.

examples/configs/grpo_math_1B_megatron.yaml (1)

92-92: Resolve moe_permute_fusion flag presence
The moe_permute_fusion: false line is already present (line 92), matching other configs and placed correctly after moe_router_bias_update_rate. No further changes needed.

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml (1)

55-55: Verify moe_permute_fusion plumbing with sequence_parallel=False and pipeline_model_parallel_size=2. Cannot confirm the flag is actually consumed by the worker in this topology—please ensure it’s passed through and that the corresponding kernels exist; if not, keep it false and document.

3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1)

1-1: Megatron-Bridge submodule bump verified—next steps

Verified that Megatron-Bridge is pinned to b99e1ca4e45d648fd8e2bffd63f5a41ede04c3da and Megatron-LM to 383d1144c3b3f77096c63b7308402a0ea6ba47dd, and .gitmodules correctly tracks the yifu/nemo-rl-use-chunkpatch-ds branch as shallow.

Manually confirm that this Megatron-Bridge commit remains API-compatible with the Megatron-LM bump.

Once PR 440 merges upstream, re-point .gitmodules to a stable branch/tag and pin to a released commit for reproducibility.

Ensure CI runs git submodule sync && git submodule update --init --recursive so nightlies and CP jobs use the exact pointers.

Add a brief note to RELEASE_NOTES or README explaining the temporary branch tracking and the plan for moving to an upstream tag.

examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron.yaml (1)

59-59: Adds moe_permute_fusion flag — OK

Flag wiring looks consistent with other configs and worker code. No issues.

examples/configs/sft.yaml (1)

89-89: Adds moe_permute_fusion flag — OK

Matches the repo-wide pattern; safe even with megatron_cfg.enabled: false.

examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.yaml (1)

65-65: Adds moe_permute_fusion flag — OK

Placement and default align with other configs.

examples/configs/recipes/llm/grpo-math-qwen3-30ba3b-megatron-tp4-32k.yaml (1)

80-81: Adds moe_permute_fusion flag — OK

Consistent with test and tooling updates.

examples/configs/recipes/llm/grpo-qwen2.5-7b-instruct-4n8g-megatron.yaml (1)

61-61: Adds moe_permute_fusion flag — OK

No concerns.

examples/configs/recipes/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.yaml (1)

45-45: Adds moe_permute_fusion flag — OK

Keeps MoE options consistent across large-model recipes.

.gitmodules (1)

4-4: Submodule branches point to feature branches — please confirm CI reproducibility

Tracking yuya/nemo-rl-use-2 and yifu/nemo-rl-use-chunkpatch-ds can drift. Ensure:

Submodule SHAs in the superproject are pinned to the tested commits.

Nightly/baseline runs are captured for these exact SHAs.

Would you like a small script to print current submodule SHAs and compare against the bridge PR baselines?

Also applies to: 9-9
nemo_rl/models/megatron/common.py (2)
94-99: Potential mismatch: cu_seqlens_padded[-1] forced without validating total padded length

If pad_packed_seq_to < already-accumulated padded total, cu_seqlens_padded will become inconsistent with tensors.

Apply a guard:
-        if pad_packed_seq_to is not None:
-            cu_seqlens_padded[-1] = pad_packed_seq_to
+        if pad_packed_seq_to is not None:
+            if pad_packed_seq_to < cu_seqlens_padded[-1]:
+                raise ValueError(
+                    f"pad_packed_seq_to ({pad_packed_seq_to}) < total padded length ({int(cu_seqlens_padded[-1])})"
+                )
+            cu_seqlens_padded[-1] = pad_packed_seq_to
229-247: Unpack path ignores CP when cu_seqlens_padded is None

For CP>1 with no per-seq padding, cu_seqlens_padded can be None, and this branch treats it as “no CP,” yet output_tensor is CP-sharded. This will mis-index.

Either:

Always set cu_seqlens_padded=cu_seqlens.clone() when CP>1, or

Gate unpacking on CP explicitly (use group/world CP metadata), or

Document that unpack requires per-seq padding when CP>1.
Would you like a patch to always clone cu_seqlens when cp_size>1?
tools/refit_verifier.py (1)

213-215: Add of moe_permute_fusion looks good

Flag is plumbed consistently with worker usage. No concerns.

examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron.yaml (1)

88-91: Add of moe_permute_fusion is consistent

Matches worker/config plumbing elsewhere. Good default to false.

tests/unit/models/policy/test_megatron_worker.py (1)

98-101: Test config updated with moe_permute_fusion

Looks correct and keeps tests aligned with runtime config.

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron-seqpack.yaml (1)

55-58: Add of moe_permute_fusion is fine

Consistent with other YAMLs and worker mapping.

examples/configs/sft_openmathinstruct2_megatron.yaml (1)

85-88: Add of moe_permute_fusion is fine

Keeps configs uniform.

tests/unit/models/generation/test_vllm_generation.py (1)

168-171: Add of moe_permute_fusion in Megatron test config

Matches policy worker expectations. Looks good.

nemo_rl/models/policy/megatron_policy_worker.py (2)

64-66: FSDP adapter import path change — verify submodule compatibility

Ensure the updated Megatron-LM/Megatron-Bridge pointers include megatron.core.distributed.fsdp.mcore_fsdp_adapter and CI runs Megatron nightlies per reviewer request.

Would you run the Megatron nightly suite and the RL unit tests on this PR branch to confirm the import path exists across all test matrices?

566-567: moe_permute_fusion is correctly wired into model_cfg

Mapping from cfg.megatron_cfg to model_cfg is clear. No concerns.

3rdparty/Megatron-LM-workspace/Megatron-LM

nemo_rl/models/policy/megatron_policy_worker.py

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2025-09-10T07:12:37Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 66bd252 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2025-09-10T07:25:46Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 32eaed7 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2025-09-10T07:40:05Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 26f4f4d (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2025-09-10T07:46:21Z

✅ Submodule Fast-Forward Check Results

Check based on commit: b1847d1 (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

3rdparty/Megatron-Bridge-workspace/setup.py (3)
48-48: Don’t ship megatron-core’s dev extras in runtime deps.

Pulling dev drags test/build tooling into user installs and increases resolver churn. Keep mlm only; expose dev via a dedicated extras or docs.
-    "megatron-core[dev,mlm]>=0.14.0a0,<0.16.0",
+    "megatron-core[mlm]>=0.14.0a0,<0.16.0",
If the submodule’s pyproject currently includes [dev,mlm], update it in tandem so the consistency check still passes.

16-17: Guard toml parsing for Python <3.11 and avoid hard fail when parser isn’t available.

tomllib is 3.11+. On older Pythons this will crash before we can even decide to skip the check. Add a tomli fallback and skip the consistency check if neither is present.
-import tomllib
+try:
+    import tomllib  # py311+
+    _HAVE_TOMLLIB = True
+except ModuleNotFoundError:
+    try:
+        import tomli as tomllib  # py<=3.10 fallback
+        _HAVE_TOMLLIB = True
+    except ModuleNotFoundError:
+        _HAVE_TOMLLIB = False
@@
-if os.path.exists(bridge_src_dir):
+if os.path.exists(bridge_src_dir) and _HAVE_TOMLLIB:
@@
-    with open(pyproject_path, "rb") as f:
-        data = tomllib.load(f)
+    with open(pyproject_path, "rb") as f:
+        data = tomllib.load(f)
@@
-    else:
+    else:
         print(
             "[megatron-bridge][setup] Dependency sets are consistent with the submodule pyproject.",
             file=sys.stderr,
         )
+elif os.path.exists(bridge_src_dir):
+    print(
+        "[megatron-bridge][setup] tomllib/tomli not available; skipping dependency consistency check.",
+        file=sys.stderr,
+    )
If you prefer enforcing the check, add a conditional dep instead: tomli>=2.0.1; python_version<'3.11' to both this file and the submodule pyproject so the set comparison still matches.

Also applies to: 55-65, 72-101

52-53: Pin a minimal filelock version (and ensure submodule pyproject matches).

Unpinned filelock can resolve very old releases; set a floor for reproducibility and to ensure modern APIs.
-    "filelock",
+    "filelock>=3.12",
Confirm the same constraint exists in Megatron-Bridge/pyproject.toml to keep the consistency check passing.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26f4f4d and b1847d1.

📒 Files selected for processing (2)

3rdparty/Megatron-Bridge-workspace/Megatron-Bridge (1 hunks)
3rdparty/Megatron-Bridge-workspace/setup.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

3rdparty/Megatron-Bridge-workspace/Megatron-Bridge

🔇 Additional comments (1)

3rdparty/Megatron-Bridge-workspace/setup.py (1)

47-49: Verify megatron-core version range
The install_requires pin in setup.py (megatron-core[dev,mlm]>=0.14.0a0,<0.16.0) must cover the APIs introduced after commit d07e4e5—if your changes rely on 0.15.x functionality, tighten it to >=0.15.0,<0.16.0.

3rdparty/Megatron-Bridge-workspace/setup.py

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

github-actions · 2025-09-10T08:20:11Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 2999f2b (PR #1059 from yifu/mbridge_dsv3_mcoretot)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)
Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

…#1059) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Ahmad Kiswani <kiswani.ahmad@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

yfw added 8 commits September 2, 2025 18:29

deepseek w/ megatron-bridge

bbfc9eb

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Print fix

9bf6481

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update megatron-bridge

e6f98bc

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

mcore tot

e9c5390

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Fixes for latest mcore

379b013

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Correct strict loading megatron bridge config

5645f51

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Add sequence+context parallel to import

28d193d

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Add moe_permute_fusion config

8caa938

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw requested review from terrykong and yaoyu-33 September 4, 2025 00:33

yfw changed the title ~~feat: CP support for deepseek~~ feat: Support deepseek with Megatron-Bridge + CP support Sep 4, 2025

yfw changed the title ~~feat: Support deepseek with Megatron-Bridge + CP support~~ feat: Deepseek migration to Megatron-Bridge + CP support Sep 4, 2025

yfw added 3 commits September 3, 2025 23:35

Update Megatron-Bridge

3ddefa2

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update .gitmodules

cbae53c

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update Megatron-Bridge with expert_tensor_parallel_size fix

32e0720

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw mentioned this pull request Sep 4, 2025

Explicitly pass expert_tensor_parallel_size to initialize_model_parallel NVIDIA-NeMo/Megatron-Bridge#537

Merged

yfw and others added 11 commits September 4, 2025 13:03

Include both commits for expert tp fix

50da32c

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Reset parallelism configs to default after initial import

89aaca8

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Formatting

9b2b5d8

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update Megatron-Bridge

31b0d7f

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Remove prev fix

62bae7d

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Fix get_ltor_masks_and_position_ids regression

8ce7a95

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Another fix

a5c3210

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

fix gemma and fix common.env

92d2840

Signed-off-by: Terry Kong <terryk@nvidia.com>

Merge remote-tracking branch 'origin' into yifu/mbridge_dsv3_mcoretot

3ed7e81

Merge commit 'refs/pull/1085/head' of https://github.com/NVIDIA-NeMo/RL…

d83ba06

… into yifu/mbridge_dsv3_mcoretot

ruff

07eb8a5

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

3rdparty/Megatron-LM-workspace/Megatron-LM Show resolved Hide resolved

nemo_rl/models/policy/megatron_policy_worker.py Show resolved Hide resolved

Update Megatron-Bridge

66bd252

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update Megatron-Bridge

32eaed7

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update Megatron-Bridge

26f4f4d

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update cached dependencies

b1847d1

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

3rdparty/Megatron-Bridge-workspace/setup.py Show resolved Hide resolved

Use correct branch

2999f2b

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

terrykong approved these changes Sep 10, 2025

View reviewed changes

terrykong enabled auto-merge September 10, 2025 17:43

terrykong added this pull request to the merge queue Sep 10, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 10, 2025

terrykong added this pull request to the merge queue Sep 11, 2025

Merged via the queue into main with commit 16d9128 Sep 11, 2025
26 checks passed

terrykong deleted the yifu/mbridge_dsv3_mcoretot branch September 11, 2025 04:10

coderabbitai bot mentioned this pull request Sep 14, 2025

feat: FP8 Training in Megatron Path #971

Merged

4 tasks

coderabbitai bot mentioned this pull request Sep 15, 2025

feat: VLM support via megatron backend #1115

Merged

4 tasks

terrykong mentioned this pull request Sep 16, 2025

Update 3rd party megatron-lm to tot #1042

Closed

4 tasks

This was referenced Oct 12, 2025

feat: add Megatron support for on-policy distillation #1324

Merged

chore: use pydantic for yaml test validation #1382

Merged

This was referenced Jan 18, 2026

perf: DeepEP interface in megatron backend #1794

Merged

ci: introduce renovate to deal with bumping our dependencies #1823

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Deepseek migration to Megatron-Bridge + CP support#1059

feat: Deepseek migration to Megatron-Bridge + CP support#1059
terrykong merged 30 commits intomainfrom
yifu/mbridge_dsv3_mcoretot

yfw commented Sep 4, 2025 •

edited

Loading

Uh oh!

terrykong commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 6, 2025

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yfw commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

terrykong commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 6, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Sep 9, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Sep 10, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Sep 10, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Sep 10, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yfw commented Sep 4, 2025 •

edited

Loading